fix: kill ghost agent processes on Ctrl+C#27
Conversation
Add _kill_process_group() and _SESSION_KWARGS to support killing agent processes and all their children when the ralph loop is cancelled or times out. This is the foundation for fixing computerlovetech#26. Co-authored-by: Ralphify <noreply@ralphify.co>
Use start_new_session=True in _run_agent_streaming so the agent and its children form a separate process group. Replace proc.kill() calls with _kill_process_group() for proper group-wide cleanup on timeout and in the finally block. Also harden _kill_process_group with a pgid==pid guard and SIGTERM-before-SIGKILL strategy for safer WSL behavior. Co-authored-by: Ralphify <noreply@ralphify.co>
Use subprocess.Popen with start_new_session=True in the blocking path so agent subprocesses form their own process group. On timeout or KeyboardInterrupt, the entire group is killed via _kill_process_group, preventing ghost processes from surviving Ctrl+C. Co-authored-by: Ralphify <noreply@ralphify.co>
Cover _kill_process_group behavior (SIGTERM/SIGKILL escalation, session leader guard, fallbacks) and verify both streaming and blocking modes use start_new_session and call _kill_process_group on timeout/interrupt. Co-authored-by: Ralphify <noreply@ralphify.co>
Remove redundant _kill_process_group integration tests that duplicated unit test coverage, merge similar fallback tests, and trim helper docstrings. Reduces test additions by ~100 lines without losing coverage. Co-authored-by: Ralphify <noreply@ralphify.co>
Co-authored-by: Ralphify <noreply@ralphify.co>
Merge TestKillProcessGroup and TestProcessGroupIsolation into a single TestProcessGroupCleanup class with a class-level pytestmark for the win32 skipif. Remove redundant docstrings and unused os import. Co-authored-by: Ralphify <noreply@ralphify.co>
kasperjunge
left a comment
There was a problem hiding this comment.
Thanks for tackling this — ghost processes are a real problem and the process group isolation approach is solid.
However, I'm concerned about the 3-second SIGTERM→SIGKILL behavior as the default Ctrl+C. If the agent is mid-way through writing files or doing a multi-file refactor, a hard kill leaves the working directory in a corrupted state. That's potentially worse than the ghost process.
What I'd like instead is a two-stage Ctrl+C:
- First Ctrl+C — Signal the loop to stop after the current iteration finishes naturally. The agent completes its work, no orphaned process, no interrupted writes. Show a clear UI message like
"Finishing current iteration… (Ctrl+C again to force kill)". - Second Ctrl+C — Hard kill with the SIGTERM→SIGKILL escalation from this PR. For when the user truly wants to abort immediately and accepts the risk.
The engine already has stop_requested on RunState, so the first Ctrl+C path may mostly exist. The process group cleanup from this PR becomes the escalation path for the second Ctrl+C.
Could you rework the PR with this two-stage approach? The UX messaging is important — the user should clearly understand what each Ctrl+C does.
Summary
Fixes #26 — agent subprocesses now reliably terminate on Ctrl+C, timeout, and cancellation.
_kill_process_grouphelper that sends SIGTERM with 3s grace period before SIGKILLstart_new_session=Trueto isolate agent subprocesses in their own process groupos.killpg()to only fire when the child is actually a session leaderTest plan
start_new_session=True)